Transformers documentation
CodeLlama
CodeLlama
Code Llama is a specialized family of large language models based on Llama 2 for coding tasks. It comes in different flavors - general code, Python-specific, and instruction-following variant - all available in 7B, 13B, 34B, and 70B parameters. Code Llama models can generate, explain, and even fill in missing parts of your code (called “infilling”). It can also handle very long contexts with stable generation up to 100k tokens, even though it was trained on sequences of 16K tokens.
You can find all the original Code Llama checkpoints under the Code Llama collection.
Click on the Code Llama models in the right sidebar for more examples of how to apply Code Llama to different coding tasks.
The example below demonstrates how to generate code with Pipeline, or the AutoModel, and from the command line.
import torch
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="meta-llama/CodeLlama-7b-hf",
torch_dtype=torch.float16,
device_map=0
)
# basic code generation
result = pipe("# Function to calculate the factorial of a number\ndef factorial(n):", max_new_tokens=256)
print(result[0]['generated_text'])
# infilling
infill_result = pipe("def remove_non_ascii(s: str) -> str:\n \"\"\" <FILL_ME>\n return result", max_new_tokens=200)
print(infill_result[0]['generated_text'])
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to only quantize the weights to 4-bits.
# pip install bitsandbytes
import torch
from transformers import AutoModelForCausalLM, CodeLlamaTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-34b-hf")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/CodeLlama-34b-hf",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=bnb_config
)
prompt = "# Write a Python function to check if a string is a palindrome\ndef is_palindrome(s):"
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**input_ids, max_new_tokens=200, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
Use the AttentionMaskVisualizer to better understand what tokens the model can and cannot attend to.
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("meta-llama/CodeLlama-7b-hf")
visualizer("""def func(a, b):
return a + b""")

Notes
- Infilling is only available in the 7B and 13B base models, and not in the Python, Instruct, 34B, or 70B models.
- Use the
<FILL_ME>
token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the original training pattern. This is more robust than preparing the pattern yourself.from transformers import LlamaForCausalLM, CodeLlamaTokenizer tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf") model = LlamaForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf") PROMPT = '''def remove_non_ascii(s: str) -> str: """ <FILL_ME> return result ''' input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"] generated_ids = model.generate(input_ids, max_new_tokens=128) filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0] print(PROMPT.replace("<FILL_ME>", filling))
- Use
bfloat16
for further training or fine-tuning andfloat16
for inference. - The
BOS
character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt. - The tokenizer is a byte-pair encoding model based on SentencePiece. During decoding, if the first token is the start of the word (for example, “Banana”), the tokenizer doesn’t prepend the prefix space to the string.
CodeLlamaTokenizer
class transformers.CodeLlamaTokenizer
< source >( vocab_file unk_token = '<unk>' bos_token = '<s>' eos_token = '</s>' prefix_token = '▁<PRE>' middle_token = '▁<MID>' suffix_token = '▁<SUF>' eot_token = '▁<EOT>' fill_token = '<FILL_ME>' suffix_first = False sp_model_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None add_bos_token = True add_eos_token = False clean_up_tokenization_spaces = False additional_special_tokens = None use_default_system_prompt = False **kwargs )
Parameters
- vocab_file (
str
) — Path to the vocabulary file. - unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. - eos_token (
str
, optional, defaults to"</s>"
) — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
. - prefix_token (
str
, optional, defaults to"▁<PRE>"
) — Prefix token used for infilling. - middle_token (
str
, optional, defaults to"▁<MID>"
) — Middle token used for infilling. - suffix_token (
str
, optional, defaults to"▁<SUF>"
) — Suffix token used for infilling. - eot_token (
str
, optional, defaults to"▁<EOT>"
) — End of text token used for infilling. - fill_token (
str
, optional, defaults to"<FILL_ME>"
) — The token used to split the input between the prefix and suffix. - suffix_first (
bool
, optional, defaults toFalse
) — Whether the input prompt and suffix should be formatted with the suffix first. - sp_model_kwargs (
dict
, optional) — Will be passed to theSentencePieceProcessor.__init__()
method. The Python wrapper for SentencePiece can be used, among other things, to set:-
enable_sampling
: Enable subword regularization. -
nbest_size
: Sampling parameters for unigram. Invalid for BPE-Dropout.nbest_size = {0,1}
: No sampling is performed.nbest_size > 1
: samples from the nbest_size results.nbest_size < 0
: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
-
alpha
: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.
-
- add_bos_token (
bool
, optional, defaults toTrue
) — Whether to add a beginning of sequence token at the start of sequences. - add_eos_token (
bool
, optional, defaults toFalse
) — Whether to add an end of sequence token at the end of sequences. - clean_up_tokenization_spaces (
bool
, optional, defaults toFalse
) — Whether or not to clean up the tokenization spaces. - additional_special_tokens (
List[str]
, optional) — Additional special tokens used by the tokenizer. - use_default_system_prompt (
bool
, optional, defaults toFalse
) — Whether or not the default system prompt for Llama should be used.
Construct a CodeLlama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model.
The default configuration match that of codellama/CodeLlama-7b-Instruct-hf which supports prompt infilling.
get_special_tokens_mask
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) → List[int]
Parameters
- token_ids_0 (
List[int]
) — List of IDs. - token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs. - already_has_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
method.
create_token_type_ids_from_sequences
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
- token_ids_0 (
List[int]
) — List of ids. - token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
List of token type IDs according to the given sequence(s).
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
if token_ids_1 is None, only returns the first portion of the mask (0s).
save_vocabulary
< source >( save_directory filename_prefix: typing.Optional[str] = None ) → Tuple(str)
Save the vocabulary and special tokens file to a directory.
CodeLlamaTokenizerFast
class transformers.CodeLlamaTokenizerFast
< source >( vocab_file = None tokenizer_file = None clean_up_tokenization_spaces = False unk_token = '<unk>' bos_token = '<s>' eos_token = '</s>' prefix_token = '▁<PRE>' middle_token = '▁<MID>' suffix_token = '▁<SUF>' eot_token = '▁<EOT>' fill_token = '<FILL_ME>' additional_special_tokens = None add_bos_token = True add_eos_token = False use_default_system_prompt = False **kwargs )
Parameters
- vocab_file (
str
, optional) — SentencePiece file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer. - tokenizer_file (
str
, optional) — tokenizers file (generally has a .json extension) that contains everything needed to load the tokenizer. - clean_up_tokenization_spaces (
str
, optional, defaults toFalse
) — Whether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. - unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. - eos_token (
str
, optional, defaults to"</s>"
) — The end of sequence token. - prefix_token (
str
, optional, defaults to"▁<PRE>"
) — Prefix token used for infilling. - middle_token (
str
, optional, defaults to"▁<MID>"
) — Middle token used for infilling. - suffix_token (
str
, optional, defaults to"▁<SUF>"
) — Suffix token used for infilling. - eot_token (
str
, optional, defaults to"▁<EOT>"
) — End of text token used for infilling. - fill_token (
str
, optional, defaults to"<FILL_ME>"
) — The token used to split the input between the prefix and suffix. - additional_special_tokens (
List[str]
, optional) — Additional special tokens used by the tokenizer. - add_bos_token (
bool
, optional, defaults toTrue
) — Whether to add a beginning of sequence token at the start of sequences. - add_eos_token (
bool
, optional, defaults toFalse
) — Whether to add an end of sequence token at the end of sequences. - use_default_system_prompt (
bool
, optional, defaults toFalse
) — Whether or not the default system prompt for Llama should be used.
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
This uses notably ByteFallback and no normalization.
>>> from transformers import CodeLlamaTokenizerFast
>>> tokenizer = CodeLlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
>>> tokenizer.encode("Hello this is a test")
[1, 15043, 445, 338, 263, 1243]
If you want to change the bos_token
or the eos_token
, make sure to specify them when initializing the model, or
call tokenizer.update_post_processor()
to make sure that the post-processing is correctly done (otherwise the
values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
[post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. The default configuration match that of meta-llama/CodeLlama-7b-Instruct-hf which supports prompt infilling.
build_inputs_with_special_tokens
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
- token_ids_0 (
List[int]
) — List of IDs to which the special tokens will be added. - token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
list of input IDs with the appropriate special tokens.
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. The special tokens depend on calling set_lang.
An NLLB sequence has the following format, where X
represents the sequence:
input_ids
(for encoder)X [eos, src_lang_code]
decoder_input_ids
: (for decoder)X [eos, tgt_lang_code]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
get_special_tokens_mask
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) → A list of integers in the range [0, 1]
Parameters
- token_ids_0 (
List[int]
) — List of ids of the first sequence. - token_ids_1 (
List[int]
, optional) — List of ids of the second sequence. - already_has_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
A list of integers in the range [0, 1]
1 for a special token, 0 for a sequence token.
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
or encode_plus
methods.
create_token_type_ids_from_sequences
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Create the token type IDs corresponding to the sequences passed. What are token type IDs?
Should be overridden in a subclass if the model has a special way of building those.
Updates the underlying post processor with the current bos_token
and eos_token
.